Exploiting instruction- and data-level parallelism
نویسندگان
چکیده
istorically, computer architects have taken two different approaches to high-performance computing: instruction level parallelism and data-level par-allelism. The ILP paradigm seeks to execute several instructions each cycle. It does this by exploring a sequential instruction stream and extracting independent instructions to send to several execution units in parallel. The DLP paradigm, on the other hand, uses vectoriza-tion techniques. A vector instruction specifies a series of operations to be performed on a stream of data. Each operation performed on each individual element is independent of all others, and, therefore, a vector instruction is highly parallel and can be easily pipelined. In this article, we propose a third approach to high-performance computing that combines the best of ILP and DLP techniques to provide an order of magnitude increase in performance at low complexity. Figure 1 illustrates three microarchitecture generations in the DLP world. The first vector generation, shown in Figure 1a, introduced in-order, pipelined execution of vector instructions. This generation's proto-typical machine is Cray Research's Cray-1. The second DLP generation, in Figure 1b, exploited the parallel semantics of vector instructions to implement multipipe functional units—unit replication that allows processing of more than one pair of operands per cycle. Cray Research's C90 or Nippon Electric Corp.'s SX-3 exemplified the multipipe processor. However, this generation still used the in-order execution model. Useful ILP techniques such as out-of-order execution or register renaming, which fight memory latency and improve processor throughput in the microprocessor world, have never been used in commercial vector computers. The third DLP generation, depicted in Figure 1c, is the one we are proposing in this article. This processor merges ILP and DLP in a single-processor architecture that combines three key technologies: • vector instructions, • out-of-order execution with register renaming, and • simultaneous multithreaded execution. Vectorizable code For many years, most scientific computing applications have largely followed the DLP model. Much of the vectorizable code optimized for yesterday's vector supercom-puters runs on today's superscalar microprocessors. These codes still retain their DLP characteristics. Moreover, in recent years applications containing highly regular DLP code have multiplied. In particular, many DSP and multimedia applications—graph-ics, compression, encryption—are superbly suited for vector implementation. Vector instruction sets and vector archi-tectures are an excellent match for the characteristics of data-parallel codes. Other architectures such as chip multiprocessors or multiscalar processors 2 are also good candidates to extract high performance from data-parallel code. However, vector instruction sets use fewer processor …
منابع مشابه
Exploiting Multi - Grained Parallelism for Multiple - Instruction - Stream Architectures
Exploiting parallelism is an essential part of maximizing the performance of an application on a parallel computer. Parallelism is traditionally exploited at two granularities: individual operations are executed in parallel within a processor to exploit instruction-level parallelism and loop iterations or processes are executed in parallel on different processors to exploit loop-level paralleli...
متن کاملPerformance Study of a Concurrent Multithreaded Processor
The performance of a concurrent multithreaded architectural model, called superthreading [15], is studied in this paper. It tries to integrate optimizing compilation techniques and run-time hardware support to exploit both thread-level and instruction-level parallelism, as opposed to exploiting only instruction-level parallelism in existing superscalars. The superthreaded architecture uses a th...
متن کاملEE 382C Embedded Software Systems Project Proposal
Objective: The goal of this project is to evaluate the effectiveness of two different techniques for exploiting the Instruction Level Parallelism (ILP) available in Digital Signal Processing (DSP) and Multimedia applications. VLIW (Very Long Instruction Word) architectures have multiple functional units to take advantage of such a parallelism, while the SIMD (Single Instruction Multiple Data) a...
متن کاملThe Potential of Exploiting Coarse-Grain Task Parallelism from Sequential Programs
Research into automatic extraction of instruction-level parallelism and data parallelism from sequential languages by compilers has been going on for many years. However, task parallelism has been almost unexploited by parallelizing compilers. It has been shown that coarse-grain task parallelism is a useful additional resource of parallelism for multiprocessors, but the simple and restricted ex...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IEEE Micro
دوره 17 شماره
صفحات -
تاریخ انتشار 1997